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o": 

Projection Pursuit methodology permits to solve the difficult problem of finding an estimate of 
CN a density defined on a set of very large dimension. In his seminal article, Huber (see "Projection 
^pursuit" , Annals of Statistics, 1985) evidences the interest of the Projection Pursuit method thanks 
^ to the factorisation of a density into a Gaussian component and some residual density in a context 
of Kullback-Leibler divergence maximisation. 

In the present article, we introduce a new algorithm, and in particular a test for the factorisation of 
r ^a density estimated from an iid sample. 
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1. Outline of the article 

Projection Pursuit aims at creating one or several projections delivering a maximum of informa- 

^ tion on the structure of a data set irrespective of its size. Once a structure has been evidenced, the 

Tj - corresponding data are transformed through a Gaussianization. Recursively, this process is repeated 

^ in order to determine another structure in the remaining data until no further structure can be high- 
00 ■ 

O lighted eventually. These kind of approaches for isolating structures were first studied by Friedman 
^ |Frie84] and Huber |HUB85j . Each of them details, through two different methodologies each, how 

to isolate such a structure and therefore how to estimate the density of the corresponding data. 
^ However, since Mu Zhu |ZMU04] showed the two methodologies described by each of the above 
^ authors did not in fact turn out to be equivalent when the number of iterations in the algorithms 
exceeds the dimension of the space containing the data, we will consequently only concentrate on 
Huber 's study while taking into account Mu Zhu's input. 

After providing a brief overview of Huber's methodologies, we will then expose our approach and 
objective. 

1.1. Huber's analytic approach 

A density / on M d is considered. We then define an instrumental density g with the same mean 
and variance as /. According to Huber's approach, we first carry out the K(f, g) = test - with K 
being the relative entropy (also called the Kullback-Leibler divergence). If the test is passed, then 
f = g and the algorithm stops. If the test were not to be verified, based on the maximisation of 
a h-> K(f a ,g a ) since K(f,g) = K(f a ,g a ) + K(f&,g) and assuming that K(f,g) is finite, Huber's 
methodology requires as a first step to define a vector a\ and a density /W with 

O! = arg M K(ff,g) and /« = (1) 

J a J oi 



where M% is the set of non null vectors of ~R d and f a (resp. g a ) represents the density of a T X (resp. 
a T Y) when / (resp. g) is the density of X (resp. Y). 

As a second step, Huber's algorithm replaces / with f^> and repeats the first step. 

Finally, a sequence (a± : a 2 , . . .) of vectors of M.f and a sequence of densities /W are derived from the 

iterations of this process. 

IZemark 1 

TTie algorithm enables us to generate a product approximation and even a product representation of f . 
Indeed, two rules can trigger the end of the process. The first one is the nullity of the relative entropy 
and the second one is the process reaching the d th iteration. When these two rules are satisfied, the 
algorithm produces a product approximation of f . When only the first rule is satisfied, the algorithm 
generates a product representation of f . 

Mathematically, for any integer j, such that K(f^\g) = with j < d, the process infers f^ = g, i.e. 

r(i-l) 

/ = gH 3 i=1 — — since by induction f^) = fli 3 i=l g ^i x) . Likewise, when, for all j, it gets K(f^>\g) > 

® a i faf 

with j < d, it is assumed g = f^ in order to obtain f = gU.f =1 — — , i.e. we approximate f with the 

(i-i) 

product gTlf =1 ^—. 

Even if the condition j < d is not met, the algorithm can also stop if the Kullback-Leibler divergence 
equals zero. Therefore, since by induction we have f^>) = fli\ =l with /(°) = /, we infer g = 

. Ai-l) 

/n^ =1 -p^ij. We can thus represent f as f = gH J i=1 — — . 

fa i 9 a r 

Finally, we remark that the algorithm implies that the sequence (K(f^\g))j is decreasing and non 
negative with /(°) = /. 

1.2. Huber's synthetic approach 

Maintaining the notations of the above section, we begin with performing the K(f,g) = test; 
If the test is passed, then f = g and the algorithm stops, otherwise, based on the maximisation of 
a K(f a ,g a ) since K(f,g) = K(f a ,g a ) + K(f,g&) and assuming that K(f,g) is finite, Huber's 
methodology requires as a first step to define a vector a\ and a density g^ with 

ai = arg inf K(f, g— ) and g {1) = g— . (2) 
am* g a g ai 

As a second step, Huber's algorithm replaces g with g^ and repeats the first step. 

Finally, a sequence (a±, a 2 , ■ ■ ■) of vectors of and a sequence of densities are derived from the 

iterations of this process. 

TZemark 2 

Similarly as in the analytic approach, this methodology allows us to generate a product approximation 
and even a product representation of f from g. Moreover, it also offers the same end of process rules. 
In other words, if for any j, such that j < d, we have K(f,g^) > 0, then f is approximated with 
g^ d \ And if there exists j, such that K(f,g^) = 0, then it holds gk>) = f, i.e. f is represented by 
gti\ In this case, the relationship K(f,g^) = implies that gk>) = f, i.e. since by induction we 
have = g^ i=1 ^ with g<® = g, it holds f = gU^^j. 

Eventually, we note that the algorithm implies that the sequence (K(f,g^))j is decreasing and non 
negative with g^ = g. 



Finally, in |ZMU04] , Mu Zhu shows that, beyond d iterations, the data processing of these method- 
ologies evidences significant differences, i.e. that past d iterations, the two methodologies are no 
longer equivalent. We will therefore only consider Huber's synthetic approach since g is known and 
since we want to find a representation of /. 

1.3. Proposal 

We begin with performing the K(f,g) = test; should this test be passed, then f = g and 
the algorithm stops, otherwise, the first step of our algorithm consists in defining a vector a% and a 
density g^ by 

ai = arg inf K{g—J) and g (1) = g—. (3) 
g a g ai 

In the second step, we replace g with g^ l > and we repeat the first step. We thus derive, from the 
iterations of this process, a sequence (ax, a 2 , •••) of vectors in M% and a sequence of densities g^>. We 
will prove that a\ simultaneously optimises ([T]), (j2J) and (jSJ). We will also prove that the underlying 
structures of / evidenced through this method are identical to the ones obtained through the Huber's 
methods. 

IZemark 3 

As in Huber's algorithms, we perform a product approximation and even a product representation of 
/■ 

In the case where, at each of the d th first steps, the relative entropy is positive, we then approximate 
f with g (d \ 

In the case where there exists a step of the algorithm such that the Kullback-Leibler divergence equals 
zero, then, calling j this step, we represent f with g^ . In other words, if there exists a positive 
integer j such that K(g^\f) = 0, then, since by induction we have = gH 3 i=1 ("2^ with g^ = g, 

we represent f with the product glB =1 . 

We also remark that the algorithm implies that the sequence (K(gv\ f))j is decreasing and non 
negative with = g. 

Finally, the very form of the relationship |3|) demonstrates that we deal with M-estimation. We can 
consequently state that our method is more robust than Huber's - see lYOHAIf , ITOMAf as well as 
IHUBERf . 

Example 1 Let f be a density defined on R 10 by f(xi, . . . , x\o) = r)(x 2 , ■ ■ ■ , xio)((xi), with rj being a 
multivariate Gaussian density on M 9 , and ( being a non Gaussian density. 
Let us also consider g, a multivariate Gaussian density with the same mean and variance as f. 
Since g(x 2 , x X q/xt) = rj(x 2 , . . . , x 10 ), we have K(g&, f) = K(rj.f 1: f) = K(f, /) = as fx = £ 
- where f\ and g\ are the first marginal densities of f and g respectively. Hence, the non negative 
function a i-> K(g&, f) reaches zero for e\ = (1, 0, . . . , 0)'. 
We therefore infer that g(x 2 , ■ ■ ■ ,xio/xi) = f(x 2 , ■ ■ ■ ,Xiq/xi). 

To recapitulate our method, if K(g, f) = 0, we derive / from the relationship / = g; should a 
sequence (oi)i=i,...j, j < d, of vectors in M% defining g^) and such that K(g^',f) = exist, then 
f(./ajx, 1 < i < j) — g(./ajx, 1 < i < j), i.e. / coincides with g on the complement of the vector 
subspace generated by the family {ai}j=i,...,j - see also section 2.1.2. for details. 



In this paper, after having clarified the choice of g, we will consider the statistical solution to the 
representation problem, assuming that / is unknown and Xi, X 2 ,... X m are i.i.d. with density /. We 
will provide asymptotic results pertaining to the family of optimizing vectors a^^m - that we will define 
more precisely below - asm goes to infinity Our results also prove that the empirical representation 
scheme converges towards the theoretical one. Finally, we will compare Huber's optimisation methods 
with ours and we will present simulations. 

2. The algorithm 
2.1. The model 

As described by Friedman |Frie84] and Diaconis [DIAFREE84], the choice of g depends on the 
family of distribution one wants to find in /. Until now, the choice has only been to use the class of 
Gaussian distributions. This can also be extended to the class of elliptical distributions. 

2. 1.1. Elliptical distributions 

The fact that conditional densities with elliptical distributions are also elliptical - see |CAM BANIS81j. 
[LANDS03J - enables us to use this class in our algorithm - and in Huber's algorithms. 

Definition 1 X is said to abide by a multivariate elliptical distribution, denoted X ~ Ed(fi, £, 
if X has the following density, for any x in M. d : fx(x) = ^f/ 2 Cd(^{ x ~ aO'E -1 ^ ~~ > 
where £ is a d x d positive- definite matrix and where // is an d- column vector, 
where ^ is referred as the "density generator" , 

where Cd is a normalisation constant, such that Cd = ^!)<V 2 ( Jo°° x ^ 2 -1 £d( a ') £ ^ c J ; 
with J °° x d ^ 2 ~ l ^d{x)dx < 00. 

Property 1 1/ For any X ~ E d (^, £, for any m x d matrix with rank m < d, A, and for any 

m-dimensional vector, b, we have AX + b ~ E m {A\i + b, AT.A', £ m ). 

Any marginal density of multivarite elliptical distribution is consequently elliptical, i.e. 

X = (X 1 ,X 2 ,...,X d ) ~ £ d (/i,E,£ d ) implies that X t ~ E^af,^) with f x .(x) = a£i(|(^) 2 ), 

l<i<d. 

2/ Corollary 5 of I CA MBA NIS8 11 states that the conditional densities with elliptical distributions 
are also elliptical. Indeed, if X = (Xi^X?)' ~ E^fi,^,^), with X\ (resp. X 2 ) of size d\ < d (resp. 
d 2 < d), then Xi/(X 2 = a) ~ E dl E', with ^ = /ii + E^S^a-/^) and E' = En-Ei2E 22 1 E 2 i, 
with n = (/ii,/i 2 ) and E = (Sy)i<u<2- 

TZemark 4 In 1LANDS 03J. the authors show that the multivariate Gaussian distribution derives 
from £d( x ) — e ~ x ■ They also show that if X = (Xi, ...,Xd) has an elliptical density such that its 
marginals meet E(Xj) < 00 and E(Xf) < 00 for 1 < i < d, then fi is the mean of X and E is a 
multiple of the covariance matrix of X . From now on, we will therefore assume this is the case. 

Definition 2 Let t be an elliptical density on M. k and let q be an elliptical density on M. k ' . The 
elliptical densities t and q are said to be part of the same family of elliptical densities, if their 
generating densities are and respectively, which belong to a common given family of densities. 

Example 2 Consider two Gaussian densities J\f(0, 1) and A/"((0, 0), Id 2 ). They are said to belong to 
the same elliptical family as they both present x h-> e~ x as generating density. 



2.1.2. Choice of g 

Let / be a density on M. d . We assume there exists d non null linearly independent vectors aj, with 
1 < j < d, of M d , such that 

f(x) = n(aj +1 x, a d ~x)h(a~[ x, ajx), (4) 

with j < d, n being an elliptical density on IR^ - - 7-1 and with h being a density on W , which does not 

belong to the same family as n. Let X = (X\, X4) be a vector with / as density. 

We define g as an elliptical distribution with the same mean and variance as /. 

For simplicity, let us assume that the family {aj}i<j<d is the canonical basis of M. d : 

The very definition of / implies that (Xj + \, Xj) is independent from (X\, Xj). Hence, the 

property [1] allows us to infer that the density of (Xj+i, ...,Xd) given (Xi, ...,Xj) is n. 

Let us assume that K(g^\ f) = 0, for some j < d. We then get -? — , = a _ u — tctt, since, 

Ja i ;a 2 ---;a j g^ ' g^ 1 ...g%. 

by induction, we have g^\x) = g(x) ^-i) ■■■ <j=tj - Consequently, the fact that the conditional 

9 a ^ 9a,2 

densities with elliptical distributions are also elliptical, as well as the above relationship enable us to 
state that ) = f(./ajx, 1 < % < j) — g(./ajx, 1 < i < j). In other words, / coincides 

with g on the complement of the vector subspace generated by the family {ai}i=i,...j. 

At present, if the family {aj}i<j<d is no longer the canonical basis of lR d , then this family is again 
a basis of M. d . Hence, lemma [11] implies that 

g(./ajx, aj x) = n(aj +1 x, ajx) = f(./ajx, ajx), (5) 

which is equivalent to K(g^\ f) = 0, since by induction g^ = g (f=rj - 

The end of our algorithm implies that / coincides with g on the complement of the vector subspace 
generated by the family {ai}i=i v „j. Therefore, the nullity of the Kullback-Leibler divergence provides 
us with information on the density structure. In summary, the following proposition clarifies the 
choice of g which depends on the family of distribution one wants to find in / : 

Proposition 1 With the above notations, K(g^\f) = is equivalent to 
g(./ajx, ...,ajx) = f(./ajx, ...,ajx). 

More generally, the above proposition leads us to defining the co-support of / as the vector space 
generated by the vectors ai, aj. 

Definition 3 Let f be a density on M, d . We define the co-vectors of f as the sequence of vectors 
ai,...,aj which solves the problem K(gv\f) = where g is an elliptical distribution with the same 
mean and variance as f. We define the co-support of f as the vector space generated by the vectors 
ai, aj. 

2.2. Stochastic outline of the algorithm 

Let X\, X2,..,X m (resp. Yi, Y 2 ,..,Y m ) be a sequence of m independent random vectors with 
the same density / (resp. g). As customary in nonparametric Kullback-Leibler optimizations, all 
estimates of / and f a , as well as all uses of Monte Carlo methods are being performed using subsamples 
Xi, X 2 ,..,X n and Y x , Y 2 ,..,Y n , extracted respectively from Xi, X 2 ,..,X m and Yi, Y 2 ,..,Y m , since the 
estimates are bounded below by some positive deterministic sequence 9 m (see Appendix B). 



Let P n be the empirical measure based on the subsample Xx, X 2 ,.,X n . Let f n (resp. / a>n for any 
a in Ef) be the kernel estimate of / (resp. f a ), which is built from Xx, X 2 ,..,X n (resp. a J Xx, 
a T X2,..,a T X n ). 

As defined in section 1.3, we introduce the following sequences (ak)k>i and (g^ k ')k>i'- 

• dk is a non null vector of M. d such that = argmm a£R d, K(g( k ~^ 1%) , /), 

9 a 

• is the density such that g^ = g^ k ~^ t-D with g^> = g. 

The stochastic setting up of the algorithm uses f n and g n °^ = g instead of / and g(°) = g, since 
g is known. Thus, at the first step, we build the vector ax which minimizes the Kullback-Leibler 
divergence between f n and g^- and which estimates a±. 

Proposition [TU] and lemma H2J enable us to minimize the Kullback-Leibler divergence between /„ and 
gt^±. Defining h\ as the argument of this minimization, proposition H] shows us that this vector tends 

9a 

to a\. 

Finally, we define the density a™ as oil' = q ^" 1,m which estimates g^ through theorem [TJ 

ga 1 

Now, from the second step and as defined in section 1.3, the density unknown. Once again, 

we therefore have to truncate the samples. 

All estimates of / and f a (resp. g^ and g^) are being performed using a subsample X\, X 2 ,..,X n 
(resp. F 2 (1) ,..,^n (1) ) extracted from X x , X 2 ,..,X m (resp. y} 1] , Y 2 {1) ,..,y£ ) - which is a sequence 

of m independent random vectors with the same density g^ 1 ') such that the estimates are bounded 
below by some positive deterministic sequence 9 m (see Appendix B). 

Let P n be the empirical measure based on the subsample X±, X 2 ,..,X n . Let /„ (resp. gn , f a ,n, ga,n 
for any a in M.f) be the kernel estimate of / (resp. g^\ f a , ga) which is built from Xx, X 2 ,..,X n 
(resp. Yx , Y 2 ,..,Yn). The stochastic setting up of the algorithm uses f n and g^ instead of / 
and g^\ Thus, we build the vector d 2 which minimizes the Kullback-Leibler divergence between /„ 
and gn^^jj- - since g^ and gi are unknown - and which estimates a 2 . Proposition [TU1 and lemma 

9a, n 

/ i \ p 

[12] enable us to minimize the Kullback-Leibler divergence between f n and gn ^tt- Defining a 2 as the 

9a,n 

argument of this minimization, proposition [U shows that this vector tends to a 2 in n. Finally, we 
define the density g^ as g^ = #4 -frp which estimates g^ through theorem [TJ 

And so on, we will end up obtaining a sequence (ax,d 2 , ...) of vectors in M% estimating the co- vectors 
of / and a sequence of densities (gn)k such that gn estimates through theorem [TJ 

3. Results 

3.1. Convergence results 
3. 1.1. Hypotheses on f 

In this paragraph, we define the set of hypotheses on / which can possibly be used in our work. 
Discussion on several of these hypotheses can be found in Appendix D. In this section, to be more legi- 
ble we replace g with g^. Let = R* M(b, a,x) = J ln(ffi*^)g(x)gfi$dx- ( 

F n M(b,a) = J M(b,a,x)dP n , PM(b,a) = J M(b,a,x)f(x)dx, 
P being the probability measure of /. Similarly as in chapter V of [VDW], we define : 
(H'l) : For all e > 0, there is rj > 0, such that for all c G verifying 
||c — Ofe|| > e, we have PM(c, a) < PM(dfc, a) — rj, with a G 0. 
(H'2) : There exists a neighborhood of a^, V, and a positive function H, such 

that, for all c G V we have \M(c,a,k,x)\ < H(x) (P — a.s.) with PH < oo, 



(H'3) : There exists a neighborhood of a,k, V, such that for all e, there exists a rj such that 
for all c G V and a G 6, verifying ||a — a&|| > e, we have PM(c, afc) < PM(c, a) — 77. 
Putting I ak = ^K(g-^, f), and x -» p(6, a, 2) = M /(g)gff ) g( ^{° i"^ , we now consider : 
(H'A) : There exists a neighborhood of (a fc ,afc), V^, such that, for all (b, a) of V^, the gradient 

^(^fpiy^) anc ^ ^ ne Hessian ^(^^t^j^) exist (A_a.s.), and the first order partial derivative 
3 ^{a ^Ir)^ an d the fi rs ^ an d second order derivative of (b, a) i-> p(6, a, x) are 
dominated (A_a.s.) by integrable functions. 
(H'5) : The function (b, a) 1— )■ M(6, a, x) is C 3 in a neighborhood V" fc ' of au) for all x and all the 
partial derivatives of order 3 of (b, a) y M(b, a, x) are dominated in V£ by a P_integrable 
function H(x). 

(H'6) : P || ^M(afc, afc)|| 2 and P|| J^M(eifc, afc)|| 2 are finite and the expressions 

M(afc, Ofc) and J afc exist and are invertible. 
(.ff'7) : There exists k such that PM(a&, a&) = 0. 
(if'8) : (Varp(M(afc, ctfc))) 1//2 exists and is invertible. 

(H'O): f and g are assumed to be positive and bounded and such that K(g, f) > J \f(x) — g(x)\dx. 

3.1.2. Estimation of the first co-vector of f 

Let 1Z be the class of all positive functions r defined on R and such that g(x)r(a T x) is a density 
on M. d for all a belonging to M%. The following proposition shows that there exists a vector a such 
that y minimizes K(gr, f) in r: 

Proposition 2 There exists a vector a belonging to such that 



argmm ren K(grJ) = & and r(a T x) = ^^} . 



Following |BROKEZ] , let us introduce the estimate of K^g^ 1 , f n ), through 



K(g^Jn) = J M(a } a } x)dF n (x) 



9a 



9a 



Proposition 3 Let a := arginf aglR d K{g^ L ^ /, 



9a 



1 J n) 



Then, a is a strongly convergent estimate of a, as defined in proposition^ 

Let us also introduce the following sequences 1 and (gn)k>h f° r an Y given n - see section 2.2.: 

• dk is an estimate of a& as defined in proposition [3] with gn instead of g, 

• g n is sucn tnat # n - 5-, (xj - (/„ [x ) [§{k _ 1}] ^ n{ ^ x) , i.e. # n (xj - ^W^if^iF^]— 
We also note that gn is a density. 

3.1.3. Convergence study at the k th step of the algorithm: 

In this paragraph, we show that the sequence converges towards and that the sequence 

(gn^)n converges towards g^ k \ 

Let c n (a) = argsup cg0 F n M(c,a), with a G 0, and 7 n = arginf ag e sup cg0 P n M(c, a). We state 
Proposition 4 i?ot/i sup age ||c n (a) — a&|| and 7„ converge toward a.s. 
Finally, the following theorem shows that g n converges almost everywhere towards g^\ 
Theorem 1 It holds gh® — > n g^ a.s. 



3.2. Asymptotic inference at the k th step of the algorithm 

The following theorem shows that g^ converges towards at the rate Op(m _3 +3) i n three dif- 
ferent s cases, namely for any given x, with the L 1 distance and with the Kullback-Leibler divergence: 



Theorem 2 It holds \g { n\x) - g (k \x)\ = P (m~^), J \g { n\x) - g {k) (x)\dx = P (m"^) and 
\K(gM,f)-K(g( k \f)\ = P (m-^). 

Then, the following theorem shows that the laws of our estimators of a k , namely c n {a k ) and 7„, 
converge towards a linear combination of Gaussian variables. 

Theorem 3 It holds ^iA.(c n (a k ) - a k ) £ -T B.Af d (0, P|| |M(a fc , a k ) || 2 ) + CJV d (0, P|| £M(a k} a k ) f) 

and^iA.(%-a k ) £ -T C.Af d (0, P|||M(a fe) a fe )|| 2 ) + C.Af d (0, P\\£M(a k , a k )\\ 2 ) 

where A = Pj^M(a fc , a fc )(P^M(a fc , a k ) + P^M(a fc , a k )), 

C = P^M(a fc , a k ) and B = P^M(a fc , a k ) + P^g-M(a fc , a fc ) + P^M(a fc , a k ). 

3.3. v4 stopping rule for the procedure 

(k) 

In this paragraph, we show that gk converges towards / in k and n. Then, we provide a stopping 
rule for this identification procedure. 

Z.Z.I. Estimation of f 

Through remark [5] and as explained in section 14 of [HUB85], the following lemma shows that 
K(gn~^ (I-") , f ak ,n) converges almost everywhere towards zero as k goes to infinity and thereafter 
as n goes to infinity : 

Lemma 1 We have lim n lim^ Kig^ ,J^ n , f n ) = a.s. 

[9 '\a k ,n 

Consequently, the following proposition provides us with an estimate of /: 

Theorem 4 We have lim n lim^ cfa = f a.s. 

3.3.2. Testing of the criteria 

In this paragraph, through a test of the criteria, namely a i-» K (gn ^({°'" , f n ) ; we build a 
stopping rule for this identification procedure. First, the next theorem enables us to derive the law 
of the criteria: 

Theorem 5 For a fixed k, we have 

V^(Var P (M(c n (%),%)))- 1 / 2 (F n M(c n (%),%) -F n M(a k ,a k )) C -T N(0,I), 
as n goes to infinity, where k represents the k th step of the algorithm and I is the identity matrix in 



Note that k is fixed in theorem [5] since j n = ar ginf ag e sup cg @ F n M(c, a) where M is a known 
function of k, see section 3.1.1. Thus, in the case where K(g^ k -^J^j, f) = 0, we obtain 

Corollary 1 We have ^i(Var P (M(c n (%), %)))- x / 2 (F n M(c n (%), %)) £ -T A/"(0,J). 



Hence, we propose the test of the null hypothesis 

(H ) : K(g( k -^^,f) = versus (H x ) : K(g^^BrjJ) ^ 0. 

9 a k @ a k 

Based on this result, we stop the algorithm, then, defining a k as the last vector generated, we derive 
from corollary [1] a a-level confidence ellipsoid around ajt, namely 

S k = {be R d ; ^(Var P (M(b,b)))- 1 / 2 F n M(b,b) < g^ (0 ' 1} } 
where q^^ ' 1 ^ is the quantile of a a-level reduced centered normal distribution and where F n is the 
empirical measure araising from a realization of the sequences (X\, . . . , X n ) and (Yx, . . . , Y n ). 
The following corollary thus provides us with a confidence region for the above test: 

Corollary 2 E k is a confidence region for the test of the null hypothesis (H ) versus (Hi). 

4. Comparison of all the optimisation methods 

In this section, we study Huber's algorithm in a similar manner to sections 2 and 3. We will then 
be able to compare our methodologies. 

Until now, the choice has only been to use the class of Gaussian distributions. Here and similarly 
to section 2.1, we extend this choice to the class of elliptical distributions. Moreover, using the 
subsample X 1; X 2 ,..., X n , see Appendix B, and using the procedure of section 2.2. with K(g a ,f a ), 
see section 4.2, instead of K(gj?-,f), proposition [TU| lemma IT21 and remark enable us to perform 
the Huber's algorithm : 

• we define di and the density g^ such that d\ = ar g max agR d K(g a , f an ) and g$P = ghi^-^ 

• we define d 2 and the density g„ such that d 2 = arg max agR d K(g^l,, f a ^ n ) and g^ = di ^fp, 



,TL 



and so on, we obtain a sequence (di, d 2 , ■ ■■) of vectors in and a sequence of densities . 
4.1. Hypotheses on f 

In this paragraph, we define the set of hypotheses on / which can be of use in our present work. 
First, we denote g in lieu of g^l Let 0* = {b G | / (fSfl - l)f a (a T x) dx < oo}, 
m{b,a,x)=f ln( g -^p^)g a (a T x) dx - (g^g _ 1), 

P a m(b, a) = J m(b,a,x)f a (a T x)dx and F n m(b, a) = J m(b, a, x) ^ a ^ x ^ dF n , 
P a being the probability measure of f a . 
Similarly as in chapter V of [VP Wj . we define : 
(HI) : For all e > 0, there is rj > such that, for all b G 0* verifying 

\\b — Ofc|| > s for all a G O, we have P a m(b, a) < P a m(afc, a) — rj, 
(H2) : There exists a neighborhood of a/., V, and a positive function H, such 

that, for all b G V, we have \m(b, a k , x)\ < H(x) (P a — a.s.) with P a H < oo, 
(H3) : There exists a neighborhood V of a k , such that for all e, there exists a r] such 

that for all b G V and a G 0, verifying ||a — a&|| > e, we have P a, =m(6, a^) — rj > P a m(b, a). 
Moreover, defining x — > v(b,a,x) = ln(jjp^)g a (a T x), putting: 
(HA) : There exists a neighborhood of (ctk,a k ), 14, such that, for all (b, a) of 14, 

the gradient V( ^"|° i g ) and the Hessian "H( i g ) exist (A — a.s.) and the first order partial 

derivative ^ a |° T ^ and the first and second order derivative of order 3 of (6, a) (-> v(b, a, x) 
are dominated (A_a.s.) by integrable functions. 
(H5) : The function (6, a) i-> m(b, a) is C 3 in a neighborhood 14 of (a k , a k ) for all x and all the 

partial derivatives of (b,a) H- m(b,a) are dominated in 14 by a PJntegrable function H(x). 
(HQ) : P||^m(afc, afe)|| 2 and P|| J^m(afc,afc)|| 2 are finite and the quantities 



Pg^-m(a fe ,a fc ) and P^^-m(a fc , a k ) are invertible. 
(H7) : There exists k such that Prnfa^ajj) = 0. 
(if 8) : (Varp(m(ak, a-fc))) 1 ^ 2 exists and is invertible. 

4.2. The first co-vector of f simultaneously optimizes four problems 

We first study Huber's analytic approach. Let TV be the class of all positive functions r defined on 
M and such that f(x)r~ 1 (ax) is a density on M. d for all a belonging to M%. The following proposition 
shows that there exists a vector a such that j- minimizes K(fr~ 1 , g) in r: 

Proposition 5 (Analytic Approach) There exists a vector a belonging to M~ such that 
argmin r&n ,K(fr~\g) = r(a T x) = as well as K(f,g) = K(f a , g a ) + K(f%, g). 

We also study Huber's synthetic approach. Let TZ be the class of all positive functions r defined on 
K and such that g(x)r(a T x) is a density on M. d for all a belonging to K*. The following proposition 
shows that there exists a vector a such that j- minimizes K(gr, f) in r: 

Proposition 6 (Synthetic Approach) There exists a vector a belonging to ]R~ siic/i t/iat 
argmin ren K(f,gr) = ^ r(a T x) = g£g as well as K(f, g) = K(f a , g a ) + K(f, g&). 

In the meanwhile, the following proposition shows that there exists a vector a such that j- minimizes 
K(g, fr^ 1 ) in r. 

Proposition 7 There exists a vector a belonging to such that arg min re ^' K(g, fr^ 1 ) = and 
r ( aTx ) = ftt^f ■ Moreover, we have K{g, f) = K{g a , f a ) + K{g, ff). 

TZemark 5 First, through property^ we get K(f,gj*) = K(g,fjr) = K(f^,g) an d K{f a) g a ) = 
K(g a ,f a ). Thus, proposition^ implies that finding the argument of the maximum of K(g a ,f a ) 
amounts to finding the argument of the maximum of K(f a , g a ) . Consequently, the criteria of Hu- 
ber's methodologies is a ^ K(g a , f a ). Second, our criteria is a i-> K{gj-, f) and property^ implies 
K{g, f-f~) = K(9gif)- Consequently, since IBROKEZ^ takes into account the very form of the 
criteria, we are then in a position to compare Huber's methodologies with ours. 

To recapitulate, the choice of r = ^ enables us to simultaneously solve the following four optimisation 

problems, for a G M. d : 

First, find a such that a = arginf a&Rd A^/j 2 -,^) - analytic approach - 

Second, find a such that a = arginf aeR d K(f,gj-) - synthetic approach - 

Third, find a such that a = argsup a( z R d K(g a , f a ) - to compare Huber's methods with ours - 

Fourth, find a such that a = arginf a&Rd K(gj L , f) - our method. 

4.2. On the sequence of the transformed densities (g^) 

As already explained in the introduction section, the Mu Zhu article leads us to only consider 
Huber's synthetic approach. 
4.2.1. Estimation of the first co-vector of f 

Using the subsample Xi, X 2 ,..,X n , see Appendix B, and following [BROKEZJ, let us introduce 
the estimate of K(g a ,f a>n ), through K(g a ,f a ,n) = J rn(a, a, x) ( fa 'f^ x) x) )dF n 

Proposition 8 Let a := arg sup agR d K(g a , f a>n ). 

Then, a is a strongly convergent estimate of a, as defined in proposition^ 



Finally, we define the following sequences 1 and {gh )k>i - for any given n : 

• a k is an estimate of a k as defined in proposition [S] with gli' 1 ^ instead of g, 

• g { n ] is such that g ( n ] = g and g { n k \x) = [g( A'^^i x) > Le - <?n fc) (z) = ff(^) n jU [g(^J^^ 8 ) • 

4.2.2. Convergence study at the k th step of the algorithm 

Let &„(a) = argsup bge P^m(6, a), with a 6 9, and /3„ = a?"gsup age sup bge P£ra(6, a), then 

Proposition 9 Both sup age ||fc n ( a ) — an d j3 n converge toward a k a.s. 

Finally, the following theorem shows that g^ converges almost everywhere towards g^ 1 : 

Theorem 6 For any given k, it holds cjn -^n 9^ o,.s. 

4.2.3. Asymptotic inference at the k th step of the algorithm 

The following theorem shows that g^ converges towards g^ at the rate Op(m~w) in three dif- 
ferent s cases, namely for any given x, with the L 1 distance and with the Kullback-Leibler divergence: 



Theorem 7 It holds \gn\x) — g^ k \x)\ = Op(m *+ d ), J ' \jjn\x) — g^ k \x)\dx = P (m 4 + d ) and 
\K(f,$ ) )-K(f,gW)\ = P (m-^). 

The following theorem shows that the laws of Huber's estimators of a k , namely b n (ak) and /3 n , 
converge towards a linear combination of Gaussian variables. 

Theorem 8 It holds ^iV.{b n {a k ) - a k ) £ -T £.M d {0, P|||m(a fe , a fc )|| 2 ) + FM d {0, P\\^m{a k , a k )\\ 2 ) 
and y/EV.0 n - a k ) £ -T 0JV d (O, P\\£m(a k , a k )\\ 2 ) + F.N d (0, P\\f b m(a k , a k )\\ 2 ) 
where £ = P^m(a fc , a k ), 7 = P^m(a fc , a k ), Q = P J^m(a fe , a k ) and 
V = (P^m(a fc , a k )P£zm(a k , a k ) - P^m(a fe , a k )P^m(a k , a k )) > 0. 

4.3. A stopping rule for the procedure 

We first give an estimate of /. Then, we provide a stopping rule for this identification procedure. 



IZemark 6 In the case where f is known, as explained in section 14 of the sequence 

(i^(^afc _1 \ fa k ))k>i converges towards zero. Many authors have studied this hypothesis and its conse- 
quences. For example, Huber deducts that, if f can be deconvoluted with a Gaussian component, 
[K{ga^~ 1 \ fa k ))k>i converges toward 0. He then shows that uniformly converges in L 1 towards f 
- see propositions 14-2 and 14-3 page 461 of his article. 

4.3.1. Estimation of f 

The following lemma shows that lim/. K(g^ >n , fa k ,n) converges towards zero as k goes to infinity 
and thereafter as n goes to infinity : 

Lemma 2 We have lim n lim fe K(gi k k ] n , / 0fc) n) = 0, a.s. 

Then, the following theorem enables us to provide simulations through an estimation of / 
Theorem 9 We have lim„ lim^ cjn = f, a.s. 



4.3.2. Testing of the criteria 

In this paragraph, through a test of Huber's criteria, namely a K(gi k },,, f a , n ), we will build a 
stopping rule for the procedure. First, the next theorem gives us the law of Huber's criteria. 

Theorem 10 For a fixed k, we have 

^(Var P (m(b n n )J n )))- l /\F n m(b n n )J n ) -F n m(a kl a k )) C -T Af(0,I), 
as n goes to infinity, where k represents the k th step of the algorithm and I is the identity matrix in 
R d . 

Note that k is fixed in theorem [10] since f3 n = arg sup age sup feg0 P"m(6, a) where m is a known 
function of k - see section 4.1. Thus, in the case where K(gi k \ f a ) = 0, we obtain 

Corollary 3 

We have v^(^ar P (m(6 n (/3 n ), /3 n )))- 1 / 2 (P„m(6 n (/3 n ), /3 n )) C ™ N(0,I). 

Hence, we propose the test of the null hypothesis (H ) : K(g^ k 1 , f ak ) = versus the alternative 
(Hi) : K(gi k X \ f ak ) ^ 0. Based on this result, we stop the algorithm, then, defining a k as the last 
vector generated from the Huber's algorithm, we derive from corollary EJ a a- level confidence ellipsoid 
around a k , namely S' k = {b 6 M. d ; ^/n(Varp(m(b,b)))~ 1 / 2 F n m(b,b) < g^ 0,1 ' 1 } where is the 

quantile of a a-level reduced centered normal distribution and where P n is the empirical measure 
araising from a realization of the sequences . . . , X n ) and (Yi, . . . , Y n ). 
Consequently, the following corollary provides us with a confidence region for the above test: 

Corollary 4 S' k is a confidence region for the test of the null hypothesis (H ) versus (Hi). 

5. Simulations 

We illustrate this section by detailing three simulations. 
In each simulation, the program follows our algorithm and aims at creating a sequence of densities 
(jg®), j = l,..,fc, k < d, such that 0(0) = g, g® = g^fJlg^U and K(g^ k \f) = 0, where K 
is the Kullback-Leibler divergence and aj = ar g ird b K(g^~^ f b /[g^~^] bl f), for all j = 1, k. 
Then, in the first two simulations, the program follows Huber's method and generates a sequence of 
densities (g^), j = l,..,k, k < d, such that g(0) = g, = p«-D f a ./\gU-%. and K(f,g^) = 0, 
where K is the Kullback-Leibler divergence and aj = argsupbKdg^' 1 ^, f b ), for all j = 1, k. 
Finally, in the third example, we study the robustness of our method with four outliers. 

Simulation 1 

We are in dimension 3(=d). We consider a sample of 50(=n) values of a random variable X with 
density / defined by, 

f(x) = N ormal(xi + 22) .Gumbel(xo + 22) -Gumbel(xo + xi), 
where the Gumbel law parameters are (—3, 4) and (1, 1) and where the normal distribution parameters 
are (—5, 2). We generate a Gaussian random variable Y with a density - that we will name g - which 
has the same mean and variance as /. 

In the first part of the program, we theoretically obtain k = 2, ai = (1,0, 1) and 02 = (1, 1,0) (or 
ci2 = (1, 0, 1) and ai = (1, 1, 0) which leads us to the same conclusion). To get this result, we perform 
the following test 

(H ) : (al, o2) = ((1, 0, 1), (1, 1, 0)) versus (Hi) : (al, o2) ^ ((1, 0, 1), (1, 1, 0)). 
Moreover, if i represents the last iteration of the algorithm, then 



v ^(Kar P (M(c„(7„),7n))) ( " 1/2) I D nM(c ri (7„),7n) ™A/"(0, 1), 
and then we estimate (01,02) with the following 0.9(=a) level confidence ellipsoid 

^ = {6 G M 3 ; (yar P (M(6,6)))- 1 /2 Pn M(6,6) < qT'^/V^ ^ S = 0.03582203}. 
Indeed, if z = 1 represents the last iteration of the algorithm, then a\ G £\, and if % = 2 represents 
the last iteration of the algorithm, then 02 G £2; and so on, if i represents the last iteration of the 
algorithm, then aj G £j. 

Now, if we follow Huber's method, we also theoretically obtain k — 2, a\ — (1,0,1) and a 2 = 
(1, 1, 0) (or a 2 = (1, 0, 1) and ai = (1, 1, 0) which leads us to the same conclusion). To get this result, 
we perform the following test: 

(H ) : (a x , a 2 ) = ((1, 0, 1), (1, 1, 0)) versus (Hi) : (a 1; a 2 ) ^ ((1, 0, 1), (1, 1, 0)). 
Similarly as above, the fact that, if i represents the last iteration of the algorithm, then 

i/n(^arp(m(6„(/3 n ), /3„)))^ 1 / 2 )p„m(6 n (/3 n ), /3 n ) C -^T Af(0, 1), enables us to estimate our sequence 
of (ai), reduced to (ai,a 2 ), through the following 0.9(=a) level confidence ellipsoid 

£'. = {be M 3 ; (Var P (m(b, 6)))- 1 / 2 P n m(6, b) < qa (0,1) /Vn ~ 0.03582203}. 
Finally, we obtain 



Table 1: Simulation 1 : Numerical results of the optimisation. 





Our Algorithm 


Huber's Algorithm 




minimum : 0.317505 


maximum : 0.715135 


Projection Study : 


at point : (1.0,1.0,0) 


at point : (1.0,1.0,0) 




P- Value : 0.99851 


P- Value : 0.999839 


Test : 


H : ai G Si : False 


H : ai G £[ : False 




minimum : 0.0266514 


maximum : 0.007277 


Projection Study 1 : 


at point : (1.0,0,1.0) 


at point : (1,0.0,1.0) 




P- Value : 0.998852 


P- Value : 0.999835 


Test : 


H : a 2 G £2 '■ True 


H : a 2 G £' 2 '■ True 


K (Estimate g%\ # (2) ) 


0.444388 


0.794124 



Therefore, we conclude that / = g^ 2 \ 



Simulation 2 

We are in dimension 10 (=d). We consider a sample of 50 (=n) values of a random variable X with 
density / defined by, 

f(x) = Gumbel(x ).Normal(xi, . . . ,x 9 ), 
where the Gumbel law parameters are -5 and 1 and where the normal distribution is reduced and 
centered. 

Our reasoning is the same as in Example [TJ In the first part of the program, we theoretically obtain 
k = 1 and a\ = (1, 0, . . . , 0). To get this result, we perform the following test 

(H ) : ai = (1, 0, . . . , 0) versus (Hi) : a x ^ (1, 0, . . . , 0). 
We estimate a\ by the following 0.9(=a) level confidence ellipsoid 

£i = {b G M 2 ; (Var P (M(b,b)))- 1 / 2 F n M(b,b) < <£ {Q,1) / y/n ~ 0.03582203}. 
Now, if we follow Huber's method, we also theoretically obtain k = 1 and a\ = (1, 0, . . . , 0). To get 
this result, we perform the following test 



(H ) : ai = (1, 0, . . . , 0) versus (Hi) : a x ^ (1, 0, . . . , 0). 
Hence, using the same reasoning as in Example HJ we estimate a% through the following 0.9 (=ot) 
level confidence ellipsoid 



£'. = {b e R 2 ; (Var F (m 


(6,6)))- 1 / 2 P n m(6,6) < q^ 1] /^i~ 


0.03582203}. 


And, we obtain 






Table 2: 


Simulation 2 : Numerical results of the optimisation. 




Our Algorithm 


Huber's Algorithm 




minimum : 0.00263554 


maximum : 0.00376235 




at point : (1.0001, 


at point : (0.9902, 




0.0040338, 0.098606, 0.115214, 


0.0946806, 0.161447, 0.0090245, 


Projection Study 0: 


0.067628, 0.16229, 0.00549203, 
0.014319, 0.149339, 0.0578906) 


0.147804, 0.180259, 0.0975065, 
0.101044, 0.190976, 0.155706) 




P- Value : 0.828683 


P- Value : 0.807121 


Test : 


H : ai G E\ : True 


Hq : ai G £[ : True 


K (Estimate g$, g (1) ) 


2.44546 


2.32331 



Therefore, we conclude that / = 



Simulation 3 

We are in dimension 20 (=d). We first generate a sample with 100 (=n) observations, namely four 
outliers x — (2,0, ... ,0) and 96 values of a random variable X with a density / defined by 

f(x) = Gumbel(x ).Normal(xi, . . . , xi 9 ) 
where the Gumbel law parameters are -5 and 1 and where the normal distribution is reduced and 
centered. Our reasoning is the same as in Simulation [TJ 

We theoretically obtain k = 1 and ai = (1, 0, . . . , 0). To get this result, we perform the following test 

(H ) : ai = (1, 0, . . . , 0) versus (Hi) : a x ^ (1, 0, . . . , 0) 
We estimate a\ by the following 0.9(=a) level confidence ellipsoid 

Si = {b G R 2 ; (Var F (M(b, b)))- l / 2 ¥ n M(b, b) < g^ (0,1) / ~ 0.02533} 
And, we obtain 



Table 3: 


Simulation 3: Numerical results of the optimisation. 


Our Algorithm 




minimum : 0.024110 


Projection Study 


at point : (0.8221, 0.0901, 0.0892, -0.2020, 0.0039, 0.1001, 
0.0391, 0.08001, 0.07633, -0.0437, 0.12093, 0.09834, 0.1045, 
0.0874, -0.02349, 0.03001, 0.12543, 0.09435, 0.0587, -0.0055) 




P- Value : 0.77004 


Test : 


H : a x G Si : True 


K (Estimate g$, g {1) ) 


2.677015 



Therefore, we conclude that / = 



Critics of the simulations 

As customary in simulation studies, as approximations accumulate, results depend on the power 
of the calculators used as well as on the available memory. Moreover, in order to implement our 
optimisation in M. d of the relative entropy, we choose to apply the simulated annealing method. 
Thus, in the case where / is unknown, we will never have the certainty to have reached the desired 
minimum or maximum of the Kullback-Leibler divergence. Indeed, this probabilistic metaheuristic 
only converges, and the probability to reach the minimum or the maximum only tends towards 1, 
when the number of random jumps tends in theory towards infinity. 

We also note that no theory on the optimal number of jumps to implement does exist, as this number 
depends on the specificities of each particular problem. 

— 4 4 

Finally, we choose the 50 4 + d (resp. 100 4 + d ) for the AMISE of the simulations 1 and 2 (resp. 3). 
This choice leads us to simulate 50 (resp. 100) random variables, see |SCOTT92] page 151, none of 
which have been discarded to obtain the truncated sample. 
Conclusion 

Characteristic structures as well as one-dimensional projections and their associated distributions in 
multivariate datasets can be evidenced through Projection Pursuit. 

The present article demonstrates that our Kullback-Leibler divergence minimisation method consti- 
tutes a good alternative to Huber's relative entropy maximization approach, see [HUB85j . Indeed, 
the convergence results as well as the simulations we carried out clearly evidences the robustness of 
our methodology. 

A. Reminders 

A.l. The relative entropy (or Kullback-Leibler divergence) 

We call h a the density of a T Z if h is the density of Z, and K the relative entropy or Kullback- 
Leibler divergence. The function K is defined by - considering P and Q, two probabilities: 

K(Q, P) = J dP if P « Q and 

K(Q, P) = +oo otherwise, 
where p : x > xln(x) — x + 1 is strictly convex. 

Let us present some well-known properties of the Kullback-Leibler divergence. 
Property 2 We have K(P, Q) = P = Q. 

Property 3 The divergence function Q i— > K(Q,P) is convex, lower semi- continuous (l.s.c.) - for 
the topology that makes all the applications of the form Q t- y J fdQ continuous where f is bounded 
and continuous - as well as l.s.c. for the topology of the uniform convergence. 



Property 4 (corollary (1.29), page 19 of |LI VA J] ) If T : (X, A) — > (Y, B) is measurable and 



if K(P,Q) < oo, then K(P,Q) > K(PT ,QT 1 ), with equality being reached when T is surjective 
for (P,Q). 

Theorem 11 (theorem III. 4 of [AZE97]) Let f : I E be a convex function. Then f is a 
Lip schitz function in all compact intervals [a,b] C int{I}. In particular, f is continuous on int{I}. 



A. 2. Useful lemmas 



Lemma 3 Let f be a density in M. d bounded and positive. Then, any projection density of f - that 
we will name f a , with a G - is also bounded and positive in IR. 



Lemma 4 Let f be a density in M. d bounded and positive. Then any density f(./a T x), for any 
a G M. d , is also bounded and positive. 

Lemma 5 If f and g are positive and bounded densities, then g^ k ' is positive and bounded. 

Lemma 6 Let f be an absolutely continuous density, then, for all sequences (a n ) tending to a in M.f, 
the sequence f an uniformly converges towards f a . 

Vroof : 

For all a in M~, let F a be the cumulative distribution function of a T X and tp a be a complex function 
defined by i/j a (u, v) = F a (JZe(u + iv )) + iF a {lZe{v + iu)), for all u and v in IR. 

First, the function i/j a (u, v) is an analytic function, because x h-> /a(o T x) is continuous and as a result 
of the corollary of Dini's second theorem - according to which "A sequence of cumulative distribution 
functions which pointwise converges on M towards a continuous cumulative distribution function F on 
IR ; uniformly converges towards F on M."- we deduct that, for all sequences (a n ) converging towards 
a, ip an uniformly converges towards ip a . Finally, the Weierstrass theorem, (see proposal (10.1) page 
220 of |DI80] ). implies that all sequences ip' an uniformly converge towards ip' a , for all a n tending to 
a. We can therefore conclude. □ 

Lemma 7 The set T c is closed in L 1 for the topology of the uniform convergence. 

Lemma 8 For all c > 0, we have T c C B L i(f, c), where B L i(f, c) = {p G L 1 ; \\f — p\\i < c}. 

Lemma 9 G is closed in L 1 for the topology of the uniform convergence. 

Lemma 10 Let H be an integrable function and let C — J H dP and C n = J H dF n , 
then, C n -C = P (^). 

B. Study of the sample 

Let Xi, X 2 ,..,X m be a sequence of independent random vectors with the same density /. Let Yi, 
Y2,..,Y m be a sequence of independent random vectors with the same density g. Then, the kernel 
estimators f m , and f a , m of / and f a , for all a G Mf, almost surely and uniformly converge since we 
assume that the bandwidth h m of these estimators meets the following conditions (see [BOLEJ): 
('Hyp). h m \ m 0,mh m /^m oo, mh m / L{h^) — > m oo and L{h^)/LLm — > m oo, with L{u) = /n(-uVe). 
Let us consider A (m,a) = ±Z™ln{ f 9ai f Y $, } 9 -^ 1 , A'Jm,a) = ±WJ f 9a{ f x { , - l) fa >T^f\ 

B Q (m,a) = iTPJni /a '7 ( f v ? f% } , B'{m,a) = - { f -^& /%\ })- 

uv ' ' m 1 L g a (a'Yi) f m (Yi)-> g a (a'Yi) U\ " / m l— -LV L g a (a [ Xi) f m (Xi) > > 

Our goal is to estimate the maximum of K(g a , f a ) and the minimum of K(g^- } /)). 

To achieve this, it is necessary for us to truncate X\, X 2 ,..,X m and Yi, Y 2 ,..,Y m : 

Let us consider now a sequence 9 m such that 9 m — > 0, and ym/9^ — > 0, where y m is defined through 

2 

lemma [131 with y m = Op(m _3 + s ). We will generate f m and f^ m from the starting sample and we 
select the Xi and the Yi vectors such that f m (Xi) > 9 m and g{Yi) > 9 m , for all i and for all 6 6 Rf - 
for Huber's algorithm - and such that f m (Xi) > 9 m and gb(b T Yi) > 9 m , for all % and for all fee if - for 
our algorithm. The vectors meeting these conditions will be called X%, X 2 , X n and Yi, Y 2 , Y n . 
Consequently, the next proposition provides us with the condition required to obtain our estimates 



Proposition 10 Using the notations introduced in fBROKEZf and in sections 3.1.1. and 4-1-, it 
holds 



sup \(A (n,a) - A' (n,a)) - K{g a ,j a )\ ->■ a.s., (6) 

aeR^ 

sup \(B (n,a) - B' (n,a)) - K(g—J)\ -> a.s. (7) 

aeR^ 9a 

TZemark 7 We can take for 9 m the expression m~ y , with < v < Moreover, to estimate a k , 
k > 2, we use the same procedure than the one we followed in order to find ai with gn~^ instead of 
g - since is unknown in this case. 



C. Case study : / is known 

In this Appendix, we study the case when / and g are known. 

CI. Convergence study at the k th step of the algorithm: 

In this paragraph, when k is less than or equal to d, we show that the sequence (a k ) n converges 
towards a k and that the sequence (g^) n converges towards g^ k \ 
Both 7„ and c n (a) are M-estimators and estimate a k - see |BROKEZ] . We state 

Proposition 11 Assuming (H'l) to (H'3) hold. Both sup age ||c n (a) — a^W and % tends to a k a.s. 

Finally, the following theorem shows us that g( k > converges uniformly almost everywhere towards 
g( k \ for any k = l..d. 

Theorem 12 Assumimg (H'l) to (H'3) hold. Then, g^ — > n g^ a.s. and uniformly a.e. 

C. 2. Asymptotic Inference at the k th step of the algorithm 

The following theorem shows that g( k ' converges at the rate Op(n -1 / 2 ) in three differents cases, 
namely for any given x, with the L 1 distance and with the Kullback-Leibler divergence: 

Theorem 13 Assuming (H'O) to (H'3) hold, for any k = 1, ...,d and any x G M. d , we have 

\$ w (x)-gW{x)\=0 P {n- 1 ' 2 ), (8) 
J [g {k) (x)-gM(x)\dx = P (n~ 1 / 2 ), (9) 

\K(g( k \f)-K(g( k \f)\=0 P (n- 1 / 2 ). (10) 

The following theorem shows that the laws of our estimators of a k , namely c n (a k ) and j n , converge 
towards a linear combination of Gaussian variables. 

Theorem 14 Assuming that conditions (H'l) to (H'6) hold, then 

y^A.(c n (a k )-a k ) C -r B.M d (0,P\\§;M(a k ,a k )\\ 2 ) + C.M d (0,P\\£M(a k ,a k )\\ 2 ) and 

y/fLA.(% - a k ) £ -T C.A^(0,P|||M(a fc ,a fe )|| 2 ) + C.Af d (0,P\\£M(a k , a k )\\ 2 ) 

where A = (P^M(a fc , a k )(P^ a -M(a k) a k ) + P^-M(a k , a k ))), 

C = P|jl(a fc , a k ) and B = P^- b M(a k , a k ) + P^L_-M(a k , a k ) + P^-M(a k , a k ). 



C.3.A stopping rule for the procedure 

We now assume that the algorithm does not stop after d iterations. We then remark that, it still 
holds - for any i > d: 

• g®(x) = g{x)W k=l Jy i ^ JxV with 0<°) = g. 

. K(gM,f) > K(g^J) >K(g^\f)... > 0. 

• Theorems E2J M and M 

Moreover, through remark [5] page [10] and as explained in section 14 of [HUB85J, the sequence 
(K(g( k ~^ Tjffi) , f))k>i converges towards zero. Then, in this paragraph, we show that g^ converges 

9a k 

towards / in i. Finally, we provide a stopping rule for this identification procedure. 

C. 3.1. Representation of f 

Under (H'Q), the following proposition shows us that the probability measure with density g^ 
converges towards the probability measure with density / : 

Proposition 12 We have lim^g^' = f a.s. 

C. 3.2. Testing of the criteria 

Through a test of the criteria, namely a i-> K(g^ k ~^ we build a stopping rule for this 

9 a 

procedure. First, the next theorem enables us to derive the law of the criteria. 

Theorem 15 Assuming that (H'l) to (H'3), (H'Q) and (H'8) hold. Then, 

^(Var P (M(c n (%),%))y 1 / 2 (F n M(c n (%),%)-F n M(a k ,a k )) M(QJ), 
where k represents the k th step of the algorithm and with I being the identity matrix in M d . 

Note that k is fixed in theorem [15] since 7„ = arg inf ag e sup cge P„M(c, a) where M is a known 
function of k - see section 3.1.1. Thus, in the case where K(g^ k ~^ /) = 0, we obtain 

9a k 

Corollary 5 Assuming that (H'l) to (H'3), (H'Q), (H'l) and (H'8) hold. Then, 
^(Var P (M(c n (%),%))y 1 / 2 (F n M(c n (%),%)) C ^ ^(0,1). 

Hence, we propose the test of the null hypothesis (H ) : K(g ( - k ~ 1 ^ ^ff^ , /) = versus (Hi) : 

9a k 

K(g{k-i)_°k j-\ § Based on this result, we stop the algorithm, then, defining as the last 
vector generated, we derive from corollary [5] a a-level confidence ellipsoid around a&, namely 

S k = {b G R d ; v ^(Uar P (M(6,6)))- 1 / 2 P n M(6,6) < g^ (0 ' 1} }, 
where q^^ ' 1 ^ is the quantile of a a-level reduced centered normal distribution. 
Consequently, the following corollary provides us with a confidence region for the above test: 

Corollary 6 Sk is a confidence region for the test of the null hypothesis (Hq) versus (Hi). 

D. Hypotheses' discussion 

D .1. Discussion on (H'2). 

We verify this hypothesis in the case where : 

• ai is the unique element of such that f(./ajx) = g(./ajx), i.e. K(g(./ajx)f ai (ajx), f) = 0,(1) 

• / and g are bounded and positive, (2) 

• there exists a neighborhood V of a k such that, for all b in V and for all positive real A, there exists 
S > such that g(.jb'x) < S.f(.fb'x) with ||x|| > A (3). 



We remark that we obtain the same proof with /, g( k ^ and a&. 

First, (1) implies that = f. Hence, > / ln(j^)g^dx = -K(g&,f) > -K(gJ) as a result 
of the very construction of g&. Besides, (2) and (3) imply that there exists a neighborhood V of 
such that, for all c in V, there exists S > such that, for all x in R d , g(./c'x) < S.f(./c'x). 
Consequently, we get \M(c, a x , x)\ < \ - K(g, f)\ + \- (f^gj - 1)| < K(g, f) + S + l. 
Finally, we infer the existence a neighborhood V of such that, for all c in V, 
\M(c,a k ,x)\ < H(x) = K(g,f)+S + 1 (P - a.s.) with PH < oo. 

D . 2. Discussion on {H'3). 

We verify this hypothesis in the case where a\ is the unique element of IR^ such that f(./ajx) = 
g(./ajx), i.e. K(g(./aJ x)f ai (aj x), f) = - we obtain the same proof with /, g( fc_1 ) and a k . 
Preliminary (A): Shows that A = {{c,x) G M^\{ai} x R d ; H^fj > f^rfj and > 
/(x)} = through a reductio ad absurdum, i.e. if we assume A ^ 0. 

Thus, we have f(x) = f(./ajx)f ai (ajx) = g(./ajx)f ai (ajx) > g(./c T x)f c (c T x) > f, since ^ ^ > 

g£g implies g(./aJx)U(^x) = g(x)£$$ > 9^)^ = g(./c T x)f c (c T x) J i.e. / > /. We can 
therefore conclude. 

Preliminary (5): Sows toat 5 = {(c,x) G Rf\{ai} x R d ; < and g(x)£f^ < 

f(x)} = through a reductio ad absurdum, i.e. if we assume B ^ 0. 

Thus, we have f(x) = f(./ajx)f ai (ajx) = g(./ajx)f ai (ajx) < g(./c T x)f c (c T x) < f. 

We can thus conclude as above. 

Let us now prove (H'3). We have PM(c, ai ) - PM(c, a) = J ^( gfj^g - ^0 ) }g(x)dx. 

Moreover, the logarithm In is negative on {x G M*; ^^f c J^.^| < 1} and is positive on {x G 
Sw^j ^ Thus ' the Preliminary studies (A) and (5) show that ln{ ffJ\ c S?l ) and 

r /aiK^/ct^ih always present a negative product. We can thus conclude, since (c, a) !->■ PM(c, ai) — 
PM(c, a) is not null for all c and for all a 7^ a%. □ 

E. Proofs 

IZemark 8 1/ (H'O) - according to which f and g are assumed to be positive and bounded - through 
lemma{5\ (see page \W\) implies that g^ and g^ are positive and bounded. 

2/ Remark^ implies that f n , g n , gn and cjn^ are positive and bounded since we consider a Gaussian 
kernel. 

Proof of propositions [5] and [61 Let us first study proposition [6j 
Without loss of generality, we prove this proposition with x\ in lieu of a T X. 

We define g* = gr. We remark that g and g* present the same density conditionally to x\. Indeed, 
g*(xi) = J g*(x)dx2-..dxd = J r(xi)g(x)dx2---dxd = r(xi) J g(x)dx2-..dxd = r(xi)gi(xi). 
Thus, we can prove this proposition. We have g(-\xi) = and gi(xi)r(xi) is the marginal 

density of g*. Hence, g* is a density since g* is positive and since 



/ 9*dx = J ^i(a;i)r(xi)flf(.|a;i)daj = / 9i(^i)^ffj(J g(-\x 1 )dx 2 ..dx d )dx l = J hix^dxx = 1. Moreover, 

K(f,g*) = J f{ln(f)-ln(g*)}dx, (11) 

= J f{^{f{.\x l ))-ln(g*(.\x 1 )) + ln(f 1 (x l ))-ln(g l (xi)r(xi))}dx, 

= J f{ln{f(.\ Xl )) - ln{g{.\x x )) + Zn(/iOi)) - Zn(^ 1 (xi)r(x 1 ))}dx, (12) 

as <7*(.|a;i) = g(.\x\). Since the minimum of this last equation ffl2|) is reached through the minimiza- 
tion of f f{ln(fi(xi)) — ln(gi(xi)r(xi))}dx = K(f\,g\r), then property [2] necessarily implies that 
fi = 9ir, hence r = j x j g x . Finally, we have K(f,g) - K(f,g*) = //{Zn(/i(a?i)) - Zn(£i(xi))}cb = 
K(fi,gx), which completes the demonstration of proposition [61 

Similarly, if we replace /* = fr~ x with / and g with g*, we obtain the proof of proposition [5j □ 
Proof of propositions [2] and [7J, The proof of proposition [2] (resp. [7J is very similar to the 
one for proposition [61 save for the fact we now base our reasoning at row [11] on K(g*,f) = 
Jg*{ln(f)-ln(g*)}dx (resp. J g{ln(g*) - ln(f)}dx) instead of K(f,g*) = J f{ln(f) - ln(g*)}dx.n 
Proof of lemma II 1L 

Lemma 11 If the family (aj)j=i...d is a basis ofM> d then 
g(./ajx, ...,ajx) = n(aj +1 x, ...,ajx) = f(./ajx, ...,ajx). 

Putting A = (ax,.., ad), let us determine / in the A basis. Let us first study the function defined 
by -i/^ : M. d — >■ Mr, x i— >■ (ajx, .., ajx). We can immediately say that ip is continuous and since 
A is a basis, its bijectivity is obvious. Moreover, let us study its Jacobian. By definition, it is 
J^(xx, . . . ,Xd) = \(^r)i<i,j<d\ = l( a i,j)i<M<d| = 1^1 7^ since A is a basis. We can therefore infer 
for any x in M, d , there exists a unique y in M d such that f(x) = \A\~ 1 ^/(y), i.e. ^ (resp. y) is the 
expression of / (resp of x) in basis A, namely ^f(y) = n(yj + %, ...,yd)h(yi, ...,%), with n and h being 
the expressions of n and h in the A basis. Consequently, our results in the case where the family 
{dj}i<j<d is the canonical basis of M. d , still hold for \1/ in the A basis - see section 2.1.2. And then, 
if g is the expression of g in the A basis, we have g(./yi, •••,%) = n(yj + x, ...,yd) = ^{-/vi, ■■■iVj), i- e - 
g(./ajx, ajx) = n(aj +1 x, ajx) = f(./ajx, ajx). □ 
Proof of lemma 1121 

Lemma 12 inf agR d K(gj-, f) is reached. 

Indeed, let G be {gj 1 ; a 6 18^} and T c be T c = {p; K(p, f) < c} for all c > 0. From lemmas [7J [8] and 
M (see page[16D, we get r c nG is a compact for the topology of the uniform convergence, if r c flG is not 
empty. Hence, and since property [3] (see page[TSJ) implies that Q K(Q, P) is lower semi-continuous 
in L 1 for the topology of the uniform convergence, then the infimum is reached in L 1 . (Taking for 
example c = K(g, f), Vt is necessarily not empty because we always have K(gj-, f) < K(g, /)). □ 
Proof of lemma 1131 

2 

Lemma 13 For any continuous density f , we have y m = \ f m (x) — f(x) \ = Op(m~ 3 + 3 ). 

Defining b m (x) as b m (x) = \E(f m (x)) - f{x)\, we have y m < \f m (x) - E(f m (x))\ + b m (x). Moreover, 
from page 150 of |SCOTT92] . we derive that b m (x) = Op(S^ =1 /i^) where hj = Op(m _I + 3 ). Then, we 



infer b m (x) = Op(m 4 + d ). Finally, since the central limit theorem rate is Op(m 2), we then obtain 

1 2 2 

that y m < P (m"2) + P (m"4+d) = P (m"4+3). □ 
Proof of proposition 1101 We prove this proposition for k > 2, i.e. in the case where is 
not known. The initial case using the known density = g, will be an immediate consequence 
from the above. Moreover, going forward, to be more legible, we will use g (resp. g n ) in lieu of 
(resp. g n k )■ We can therefore remark that we have f(Xi) > 9 n — y n , g{Yi) > @n — Un and 
gb(b T Yi) > 6 n — y n , for all i and for all b G thanks to the uniform convergence of the kernel 
estimators. Indeed, we have f(Xi) = f{Xi) - f n (Xi) + f n (Xi) > -y n + f n {Xi), by definition of y n , 
and then f(Xi) > —y n + 9 n , by hypothesis on / n (JQ). This is also true for g n and g^ n . This entails 
su PbeM , I^Ulg^fj ~ ^^Sf - - m(b J x)dx\ -+ a.s. 

Indeed, we remark that |^? =1 {g#f} - - /{jjfi} - l}/ 6 (& T *)d*| 

_ \lyn [ 9b,n( bTx i) i} fb.n{b T X t ) | vn g b (b T Xj) -j f b (b T Xj) 

~ \ n ^i=l\ hn{h \ Xi ) X J /„(Xi) n^<=lA(6'Xi) A J /(X 4 ) 
_i_lyn gi,Q T X») 1 -j fb(b T Xi) r-f g b (b T x) -, -j j /lT„U„| 

<- |ly»» r 3t--"( feTx ') il /b,n(fe T X) lyn g b {b T X,) , -j f b (b T Xj) | 



+ 



lyn 3ft( feT ^) 1 1 fb(b T Xj) r ( g b (b T x) , -. f /,T \ j | 



Moreover, since / |{j^rfy — 1}/ii,(& t :e)|g?2; < 2, the law of large numbers enables us to derive: 
l^?_if{S§ - " " l}A(i» T x)<fa| -» a.,.. 

Moreover, l^xf^ - 1}^T " " 1>» 

/ 1 V n 1 r ffb,n(fr T ^) -i -j /h,n(& T Xj) r gi,(6 T Xj) 1 -j f b {b T Xi) \ 

- n^ilt/^'Xi) j /„(Xi) iM&'xo X J /(Xi) I 

„ n j I r go,n(b T X) _ 1 -1 h, n (b T X,) _ r g (b T X) _ 1 -1 A(b T Xj) 1 _ | gb,n(b T X)-/b,n(fe T V.) _ gi,(b T Xi)-/,,(6 T Xi) i 
dlU ^/b.nO'X) J /«(Xi) 'L/ i ,(f ) T X l ) J /(Xi) I "I /„(Xi) /(Xi) I 

^ |/(Xi)M/„(Xi)| {l/Wl-l^( &T ^) " 9b(b T X t )\ + \f(Xt) - f n {X % )\.\g b {b^ X t )\ 

+ \f(X t )\.\fUb T X t ) - f b (b T X t )\ + \f(Xi) - UX^.lf^X,)]}, 
through the introduction of terms g b f — gbf and ffb — ffb, 

< w^aT 1 ^ — 0p(l)-7^ — , as a result of the very definitions of 6 n and y n respectively, 

Vn-\yn Vn) Q 

Vn n 

— >■ 0, a.s. because, §3- — ?■ a.s., by hypothesis on n . 

n — - 

Consequently, ^? =1 |{f^§ " ^^f 1 - {ff^ - U^xfl 0, as it is a Cesaro mean. 
This enables us to conclude. Similarly, we prove limits |6] and [7] page [13 □ 
Proof of lemma 1141 

Lemma 14 For any p < d, we have /i^ = / a - see Huber's analytic method -, ga~ 1 ^ = g ap - see 
Huber's synthetic method - and g^ p ^ = g ap - see our algorithm. 

Vroof : 

As it is equivalent to prove either our algorithm or Huber's, we will only develop here the proof for 
our algorithm. Assuming, without any loss of generality, that the a*, i — 1, ..,p, are the vectors of the 
canonical basis, since g^- p ~ 1 \x) = g(x) ^^ ^|^) --- g P ~^^ p ~|) we derive immediately that g^~^ = g p . 
We remark that it is sufficient to operate a change in basis on the ai to obtain the general case. □ 
Proof of lemma 1151 

Lemma 15 If there exits p, p < d, such that K(g ( - P \f) = 0, then the family of (ai)i=i,..,p - derived 
from the construction of gW - is free and orthogonal. 



Vroof : 

Without any loss of generality, let us assume that p = 2 and that the ai are the vectors of 
the canonical basis. Using a reductio ad absurdum with the hypotheses a± = (1,0, ...,0) and 
that a 2 = (a, 0, 0), where a 6 M, we get g^(x) = g(x2, .., x d /xi)fi(xi) and / = g^ 2 '{x) = 
g(x 2 ,..,x d /x 1 )f 1 (x 1 ) JiT^Jal ) • Hence /(^2, .., x d /x x ) = g{x 2 , .., x d / Xl ) J^u^y 
It consequently implies that f aai (c(Xi) = \g^) a ai(^i) since 

1 = f f(x 2 ,..,X d /x 1 )dx 2 ...dx d = fg(x 2 , ..,X d /x 1 )dx 2 ...dx d JiT^ax,) = Ji^J^y 

Therefore, g^ = g^\ i.e. p = 1 which leads to a contradiction. Hence, the family is free. 
Moreover, using a reductio ad absurdum we get the orthogonality. Indeed, we have 
f f{x)dx — 1 7^ +oo = f n(aj +1 x, ...,ajx)h(ajx, ajx)dx. □ 
Proof of lemma 1161 

Lemma 16 We have 6 = {b e 9 | / (fjfyf^ffy - l)f(x)dx < oo}. 

We get the result since J { f$ g %% - l)/(x) dx = / (^gf^f 1 - f{x)) d x = 0. □ 
Proof of propositions llll In the same manner as in Proposition 3.4 of |BROKEZ] . we prove this 
proposition through lemma [JBJ □ 
Proof of propositions [4] and [91 Proposition H] comes immediately from proposition [10] page IT71 
and lemma [TT1 page [T71 Similarly, we prove proposition ED since both sup aee ||6„(a) — a k \\ and (3 n 
converge toward a k a.s. in the case where / is known - see also in Appendix C, where we carry out 
our algorithm in the case where / is known. □ 
Proof of theorem 1121 Using lemma 161 page [TBI and since, for any k, g^ = g^ k ~^ (k-i) i we prove 
this theorem by induction. □ 
Proof of theorems [JJ and [6J, We prove the theorem [1] by induction. First, by the very definition 
of the kernel estimator g n °^ = g n converges towards g. Moreover, the continuity of a i-> f a , n and 
a | — > ga,n and proposition H] imply that gn = gn^ffi converges towards g^\ Finally, since, for any 

9a,n 

k, g n = gn zw=7) ; we conclude similarly as for gii \ In a similar manner, we prove theorem [61 □ 
Proof of theorem 1131 

relationship ([81). We consider tyj = { ^J-i)^ 3 .^J.t x ^ — [ g J-i)\ 3 .(aTx) }' Since / and g are bounded, it 
is easy to prove that from a certain rank, we get, for any given x in M. d 

TZemark 9 First, based on what we stated earlier, for any given x and from a certain rank, there is a 
constant R > independent from n, such that max( ^u-i)] 1 -.(d T x) ' [gti- 1 ')] (a T x) ^ — ^ = R( x ) = 
Second, since a& is an M— estimator of a^, its convergence rate is Op(n -1 / 2 ). 

Thus using simple functions, we infer an upper and lower bound for f$. and for f a .. We therefore 
reach the following conclusion: 

< Opin- 1 / 2 ). (13) 

We finally obtain: 

lTjk f:j>."j _ ,./, f aj (aj*) I _ -prfc fajfajx) |n fc /a- j; fe T ^) [ff ~ 1} ] *j (<>} 

l ii i=l[g(i-i)] a - j (a j T :E ) ^j=l [g U-i) ]a ^ a J x) \ ii j=l [gU-D] a .(al x) \ 1L j=l ^-D] a .. (d^ x) f a .(al x) 1 I' 

Based on the relationship f[T3|) . the expression ^J-l) j [ jJ.-j ^ ^ j ;|~t~j ' tends towards 1 at a rate 



of Op(n l l 2 ) for all j. Consequently, n*? =1 w7^n~T^r \~ f rr~s ^ tends towards 1 at a rate of 

J [g\J '\iCj(ij '£) Jaj\ a j x ) 

P (n-V2). Thus from a certain rank, we get [nj =1 y£t^*\ x) ~^ =1 J^Jy. x) \ = F {n-^)O v {l) . 
In conclusion, we obtain \g( k \x) -</«(*) | = g(x)\U k =1 p^jff x) ~n* =1 Jll^ x) I < F (n~V 2 ). 
relationship ([9]). The relationship [8] of theorem [T3l implies that l^syM — 1| = Op(n -1 / 2 ) because, 
for any given x, 9^ k \ x )\ ^k)^ ~ 1| = \g <yk \ x ) ~ g^{%)\- Consequently, there exists a smooth function 
C of R d in R + such that lim^oo r^ 1 / 2 ^) = and - 1| < n~ 1/2 C(x), for any x. 

We then have / |^ fc )(a;) - ^ fc )(x)|ds = / 9 {k) {x)\^^ - l\dx < J g^ k \x)C{x)n-^ 2 dx. 

Moreover, sup seRd \g {k \x) -gW(x)\ = swp xeRd g (k) (x)\ = sup x6lRd ^^(x)C(x)^ 1 / 2 -> a.s., 
by theorem IT21 This implies that sup xmd g^ k \x)C(x) < oo a.s., i.e. sup^gRd C(x) < oo a.s. since g( k > 
has been assumed to be positive and bounded - see remark [SJ 

Thus, J g^ k \x)C{x)dx < sup C.J g ( - k \x)dx = supC < oo since g( k > is a density, we can therefore 
conclude f \g^(x) — g l - k \x)\dx < supC.n -1 / 2 = P (n -1 / 2 ). □ 
relationship ( HOB . We have 

K(g (k \ f) ~ K(g( k \ f) = J f(^) - <p(if-))dx < J f S\^f - 9 -f\dx = S J \g^ - g W\dx 
with the line before last being derived from theorem [Til page fT5l and where if : x H- xln(x) — x + 1 
is a convex function and where S > 0. We get the same expression as the one found in our Proof 
of Relationship section, we then obtain K(g^ k \f) — K(g^ k \f) < Op(n^ 1 ^ 2 ). Similarly, we get 
K(g^ k \ f) — K(g^ k \ f) < P (n~ 1 / 2 ). We can therefore conclude. □ 
Proof of lemma 1171 

Lemma 17 We keep the notations introduced in Appendix B. It holds n = O(m^). 
Vroof : 

Let us first study the Huber's case. Let N be the random variable such that 

N = S^ =1 l{/ m (Xj)>e TO) g(Yj)>e m }- Since the events {f m (Xj) > 9 m } and {g(Yj) > 9 m } are independent 
from one another and since {g(Yj) > 9 m } C {g m (Yj) > —y m + 9 m }, we can say that 

n = m.P{f m {Xj) > 9 m , g(Yj) > 9 m ) < m.P(/ m (X i ) > 9 m ).P{g m (Y j ) > -y m + 9 m ). 
Consequently, let us study P(f m (X i ) > 9 m ). Let (£i)t=i„. m be the sequence such that, for any i 
and any x in R d , = nf^ ^^- e 51 h ' ~ / n f=i (2 ^/2 fe; e ~ /(z)cte. Hence, for any 

given j and conditionally to Xl, . . . , X,-_i, Xj + i, . . . , X m , the variables (6(^j))i£i...m are i-i.d. and 
centered, have the same second moment, and are such that 

1^(^)1 < nf =1 ^7^ + nf^^i^ / = 2.(27r)- d / 2 nf =1 / i r 1 since sup^ 2 < 1. 

Moreover, noting that f m {x) = ^X^^x) + (27r)- d / 2 ^£™ x nf =1 /if 1 / e"^ 2 ^^ /(x)dx, 
we have ^(X,) > m ^ iE^(^) + (27r)- d / 2 ^E- ^/if 1 / e^^) 2 /(^^ > ^ m 

* 7^I^MX 3 ) > (9 m - ^y^^Z^tiK 1 fe"^^ 2 f(x)dx - 
with ^(Xj) = 0. Then, defining t (resp. e) as t = 2.(27r)- d / 2 nf =1 /i" 1 (resp. 

e = (ft m -(27r)- d / 2 nf =1 /i- 1 ^S™ i nf =1 / e"^ )2 f( x )dx)^), the Bennet's inequality - [DEVCY85] 
page 160 - implies that P(^E^ 1 ^(X i ) > e/X u . . . , X^ u X j+1 , . . . , X m ) < 2.exp(-^^ 

l( x l-Vl \2 
' 2 V hi i 



2 . 



Finally, since the Xj are i.i.d. and since /(/Ilf =1 e 2i fe i J f(x)dx)f(y)dy < 1, then the law of 

i r w 1 C l ~ Xil ) 2 p j i f x l~vi \2 

large numbers implies that ^E™! J Ilf =1 e 2 h i f(x)dx — > m J J IIf =1 e 2 h « f(x)f(y)dxdy a.s. 



Consequently, since < v < -^rg - see remark [JJ - and since e x < x 2 when x > 0, we obtain, 



(m-l)e 2 



4t 2 



0(m *), i.e., from a certain rank, 
) = 0(rn~ z ). In conclusion, we can 



after calculation, that, from a certain rank, exp(- 
P(f m ( Y j) > 8m) = 0(m"3). Similarly, we infer P(g(Y j ) > d r , 
say that n = m.P(f m (Xj) > 8 m ).P(g m (Yj) > 9 rn ) = 0(m?). Similarly, we derive the same result as 
above for any step of our method as well as Huber's. □ 
Proof of theorems [2] and [7J. First, from lemma [T3~l we derive that, for any x, 

T T — 2 

su PaeM d I fa, 

n (ci x J faifl ^0 1 — Op (?2 4 + d ) . Then, let us consider 



9 { J- 1} (ajxy 



we have ^fj 



((fd 3 ;n( d 3 Tx ) ~ faMl X ))A ( a J x ) + fa r (ajx)(g^ 1] (aj x) - 



9dj,n 



a;))), i.e. = Op(n~^) since f aj (ajx) = 0(1) and gs~ {cl-x) = 0(1). We can there- 
fore conclude similarly as in theorem [13] and through lemma [TTJ Similarly, we derive theorem [71 □ 
Proof of theorem 1141 First of all, we remark that hypotheses (H'l) to (H'3) imply that 7„ and 
Cn(ofc) converge towards a k in probability. Hypothesis (H'4) enables us to derive under the integrable 
sign after calculation, P-irM(a k , a k ) = P-Jr-MTa*, a*.) = 0, 



c) 2 



dciidb 



■M(a k ,a k ) 



d 2 



dbj dai 



M(a k ,a k ) 



P£M{a k ,a k ) = 
-_ r<p»(9f«k\ d gf> 



8 9fa k 



d 2 



M(a k ,a k ) = -j^(^) 



-M(a k ,a k ) 



dbidbj 

and consequently P 



9fa k \ d 9fa k 8 9fa k 

fga k ' Obi fga k dbj fga k 



fga k ' dai fg ak dbj fg af _ 



f dx, 



f dx, P- 



a 2 



d 2 



d 2 



dat da. 



P 9^7 M ( a ^' a k) - ^db-db, 
M(a k , a k ) + P-^ Bb -M(a k , a k ) = 



dai daj 

M(a k ,a k ) = -P 
M(a k ,a k ), 



M(a k ,a k 
d 2 



9fa k 



d 2 9fa k 



fga k 1 daidaj fga k 

, lh i)a M(a k ,a k ), which implies, 



/ dx, 



d 2 



■M(a k , a k ) + P 7 r j ^-M(a k , a k 



daida 



The very definition of the estimators 7„ and c n (a k ), implies that 



P n J-M(M 



P. 9 



db 1 

M(b(a),a) 



10 



F n -§- b M(c n (a k ),%) = 



F n -§-M(c n (a k ),%) 



F n -^M(c n (a k ),%)-^c n (a k ) 



0. 



i.e. 



P — 

Kn db 



da 

M(c n (a k ),% 








(E0) 



F n §-M(c n (a k ),%) = 0(El) 



Under (H'5) and (H'Q), and using a Taylor development of the (E0) (resp. (El)) equation, we infer 



there exists (c n 
-F n §- b M(a k ,a k 

d 



(resp. (c n ,7 n )) on the interval [(c n (a k ),%), (a k ,a k )} such that 



[(P 



(resp. -F n -^M(a k ,a k ) 
with a n - 

y/na Ti 



((c n (a k ) - a k 

Pg,M(a k ,a k ) 
M(a k ,a k ) 



^- b M(a k ,a k )) T + o F (l),(P^- b 
= [(p£-M(a k ,a k )y + o P (l), 
T , (% - a k ) T ). Thus we get 



M(a k ,a k )) T + o P (i))a n . 

M(a k ,a k )Y + o P (l)]a r 



d 2 



dbda 



^&M^^)Id-aK(9^J))- 1 



d 2 



- dadb M ( a ^ a k) 

p£,M(a k ,a k ) 



n -§ b M(a k ,a k ) 
A 

l da 



-p 



F n fM(a k ,a k ) 



0P (1) 



9a k 



p idb M ^ a k) + 



d 2 

dada 



■ d 2 

dbdb 



f) 



M(a k ,a k ) 



P & M ( a k,a k ) 

Caw 



-F n f b M(a k ,a k ) 
-F n fM(a k ,a k ) 



+ o?(l) 



Moreover, the central limit theorem implies: P n ^M (a k , a k ) —> Afd(0,P\\j d b M(a k ,a k ) 



F n £M(a k ,a k ) £ -T Af d (0,P\\£M(a k , a k )\\ 2 ), since P§- b M(a k ,a k ) - , - Vr 
us to the result. Finally, if / is known, we similarly prove theorem [HI □ 
Proof of theorems [3] and [8l We get the theorem through proposition [TU1 and theorem [Ml □ 
Proof of proposition 1121 We consider tp, ip a , ?p( k \ ip a k ^ the characteristic functions of densities /, f a , 
and [g^-%. We have mta) -^(ta)] = ^(t)-^ (t)\ < J \f a (a T x)-[g^%(a T x)\dx, 
and then sup a \ip a (t) - ij a k ~ l) (t)\ < sup a / |/ a (a T x) - [g ( - k ~ l ' ) ] a (a T x)\dx 



P§-M(a k ,a k ) 



0, which leads 



< sup a K([g^ k ~^] a , f a ) since ip(ta) = K(e ltaTx ) = if) a (t) - where t 6 1 and a G if - and since the 
Kullback-Leibler divergence is greater than the L 1 distance. Therefore, since, as explained in section 
14 of Huber's article, we have limfc K([g( k ~^] ak , f ak ) = we then get linife^^ = f - which is the 
Huber's representation of /. Moreover, we have \ip(t) — ip^ k \t)\ < J \f(x) — g^ k \x)\dx < K(g( k \ f). 
As explained in section 14 of Huber's article and through remark [5] page [10] as well as through the 
additive relationship of proposition [5j we infer that \im. k K{g <yk ~ 1 \ J*^. — , /) = 0. Consequently, we 
get limfc<7^ = / - which is our representation of /. 

Proof of lemmas [T] and [2l We apply our algorithm between / and g. There exists a sequence of 
densities (g {k) ) k such that = K(g(°°\ /)<..< K(g^ k \ /)<..< K(g, /), (*) 

where g^ = lim k g^ which is a density by construction. Moreover, let (g n k ^)k be the sequence 
of densities such that gi^ is the kernel estimate of g( k \ Since we derive from remark [S] page [19] 
an integrable upper bound of g n k \ for all k, which is greater than / - see also the definition of 
ip in the proof of theorem H] -, then the dominated convergence theorem implies that, for any k, 
lim n K(g n k \ f n ) = K(g( k \ /), i.e., from a certain given rank n , we have 

< .. < K(gi°°\ /„)<..< K{$\ f n ) < .. < K(jg n , f n ), (**) 
Consequently, through lemma [IS] page [25] there exists a k such that 

< < K(*<$, fn)<..< K(gt\ f n ) < .. < K(9<$_ v fn)<..< K(g m f n ), (***) 
where is a density such that ^^° k = lim^, g n k \ Finally, through the dominated convergence theo- 
rem and taking the limit as n in (***)' we get = K(g^°°\ f) = lim n K{g%°\ f n ) > lim n K{^ k \ f n ) > 
0. The dominated convergence theorem enables us to conclude: 

= lim n K(fy^ k \ f n ) = lim n lim fc K(g n k \ /„). Similarly, we get lemma |2J □ 
Proof of lemma 1181 

Lemma 18 Keeping the notations of the proof of lemmaUl we have 

< .. < K{9<$, f n )<..< K(gt\ f n ) < - < K(¥™l v /„)<..< K(g n , f n ), (***) 

Vroof : 

First, as explained in section 4.2., we have K{f <yk \g) — K{f <yk+l \ g) = K{fa k+X , 9a k+1 )- Moreover, 
through remark 151 page ITO] we also derive that K{f <yk \g) = K(g( k \f). Then, K(f^ +1 , g ak+1 ) is the 
decreasing step of the relative entropies in (*) and leading to = K(g(°°\ f). Similarly, the very 
construction of (**), implies that K(fa k > +1 , n , ya fc+1 ,n) is the decreasing step of the relative entropies in 
(**) and leading to K(g n °°\ f n ). Second, through the conclusion of the section 4.2. and the lemma 
14.2 of Huber's article, we obtain that K(fa k } +1 , n , g ak+ i,n) converges - decreasingly and in k - towards 
a positive function of n - that we will call £ n . Third, the convergence of (g^ k ')k - see proposition [T2l 
- implies that, for any given n, the sequence (K(g n k \f n )) k is not finite. Then, through relationship 
(**), there exists a k such that < K(g n k , f n ) — K{g n co \ f n ) < £ n . 

Consequently, since Q !->■ K(Q,P) is l.s.c. - see property [3]- relationship (**) implies (***). □ 
Proof of theorems [4] and [9] We recall that g n is the kernel estimator of g^ k \ Since the 
Kullback-Leibler divergence is greater than the L 1 -distance, we then have lim n limfc K(g n k \ f n ) > 
lim n limfc J \g n k \x) — f n (x)\dx. Moreover, the Fatou's lemma implies that 

limfc / \gli\x) - f n {x)\dx > /limfc [\g { n\x) - f n (x)\]dx = j \[lim k g n k) (x)] - f n {x)\dx and 
lim n j \[lim k g n k \x)] - f n (x)\dx > Jlim n [| [limfc g n k) ] - f n \]dx = j | [lim n limfc gn\x)] - lim n f n (x)\dx. 
We then obtain that = lim n limfc K(g n k \ f n ) > J | lim n limfc g^ (x) — lim n f n (x)\dx > 0, i.e. that 
j | lim n limfc gi, (x) — lim n f n (x)\dx = 0. Moreover, for any given k and any given n, the function g n k ^ 



is a convex combination of multivariate Gaussian distributions. As derived at remark HI for all k, the 
determinant of the covariance of the random vector - with density g^ k ' - is greater than or equal to 
the product of a positive constant times the determinant of the covariance of the random vector with 
density /. The form of the kernel estimate therefore implies that there exists an integrable function 
(p such that, for any given k and any given n, we have \g„ | < Finally, the dominated convergence 
theorem enables us to say that lim n lim^ = lim n / n = /, since f n converges towards / and since 
J | lim n lim fc g4 (x) — lim n f n (x)\dx = 0. Similarly, we get theorem [9j □ 
Proof of theorem 1151 Through a Taylor development of F n M(c n (a k ), j n ) of rank 2, we get at point 
(a k ,a k ): F n M(c n (a k ), %) = F n M(a k ,a k ) + F n -j^M(a k ,a k )(j n - a k ) T + F n -§ s M(a k ,a k )(c n (a k ) - a k ) T 
+\{{ln ~ a k ) T F n -^M(a k , a k )(% - a k ) + {c n {a k ) - a k ) T F n -^M(a k , a k )(% - a k ) 
+ {ln - a k ) T F n -£ M M(a k , a k )(c n (a k ) - a k ) + (c n (a k ) - a k ) T F n -£g E M(a k , a k )(c n (a k ) - a k )} 
Thus, lemma HO implies F n M(c n (a k ), %) = F n M(a k ,a k ) + Op(^), 
i.e. ^(F n M(c n (a k ), %) - PM(a k , a k )) = y/E(F n M(a k , a k ) - PM(a k , a k )) + o P (l). 
Hence y/n(F n M(c n (a k ), j n ) — PM(a k ,a k )) abides by the same limit distribution as 
y/E{F n M{a kl a k ) - PM{a k , a k )), which is Af{0, Var P {M{a k , a k ))). □ 
Proof of theorems [5] and 1101 Through proposition [10] and theorem [T5], we derive theorem [5] 
Similarly, we get theorem [TTJl □ 
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